Kurtosis Test (Anscombe–Glynn)#

The kurtosis test asks a narrow but useful question:

  • Are the tails of this distribution consistent with a Normal distribution?

It is often used as a diagnostic for:

  • spotting heavy tails / outliers (more extreme values than Normal),

  • spotting light tails (bounded / overly thin-tailed data),

  • checking Normality assumptions in residuals (regression, time series, etc.).

This notebook explains what kurtosis means, how the test works, how to interpret it, and implements the full Anscombe–Glynn kurtosis test from scratch using only NumPy (Plotly is used only for visuals).


Learning goals#

  • Understand Pearson kurtosis vs excess kurtosis

  • See how kurtosis reacts to tail events / outliers

  • Implement the kurtosis test (test statistic + p-value) with NumPy

  • Interpret the z-score sign, p-value, and common pitfalls

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from dataclasses import dataclass
from typing import Literal


pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

rng = np.random.default_rng(42)

1) What is kurtosis? (intuition)#

Kurtosis is a shape summary of a distribution that is most usefully interpreted as tail heaviness / outlier propensity.

Two common conventions:

  • Pearson kurtosis:

    \[b_2 = \frac{\mathbb{E}[(X-\mu)^4]}{(\mathbb{E}[(X-\mu)^2])^2}\]

    For a Normal distribution, \(b_2 = 3\).

  • Excess kurtosis (a.k.a. Fisher kurtosis):

    \[g_2 = b_2 - 3\]

    For a Normal distribution, \(g_2 = 0\).

Rule of thumb:

  • \(g_2 > 0\) (leptokurtic): heavier tails / more extremes than Normal

  • \(g_2 < 0\) (platykurtic): lighter tails / bounded behavior

Important: kurtosis is sometimes described as “peakedness”, but that can be misleading — the tails are the core story.

def sample_kurtosis_pearson(x: np.ndarray) -> float:
    """Biased Pearson kurtosis b2 = m4 / m2^2 (Normal -> 3).

    Uses central moments with denominator n (bias=True style).
    """
    x = np.asarray(x, dtype=float)
    x = x[np.isfinite(x)]
    n = x.size
    if n < 4:
        raise ValueError(f"Need >=4 finite observations, got {n}.")

    mean = x.mean()
    dev = x - mean
    m2 = np.mean(dev**2)
    if m2 == 0.0:
        return np.nan
    m4 = np.mean(dev**4)
    return float(m4 / (m2**2))


def standardize(x: np.ndarray) -> np.ndarray:
    x = np.asarray(x, dtype=float)
    return (x - x.mean()) / x.std(ddof=0)
# Compare several distributions after standardizing to mean=0, std=1
n_big = 50_000
samples = {
    "Normal": rng.normal(size=n_big),
    "Laplace (heavy-ish tails)": rng.laplace(size=n_big),
    "t(df=3) (very heavy tails)": rng.standard_t(df=3, size=n_big),
    "Uniform (bounded)": rng.uniform(low=-1, high=1, size=n_big),
}

rows = []
for name, x in samples.items():
    xs = standardize(x)
    b2 = sample_kurtosis_pearson(xs)
    rows.append((name, b2, b2 - 3.0))

fig = go.Figure(
    data=[
        go.Table(
            header=dict(values=["Distribution", "Pearson kurtosis b2", "Excess kurtosis g2"]),
            cells=dict(values=[
                [r[0] for r in rows],
                [f"{r[1]:.3f}" for r in rows],
                [f"{r[2]:.3f}" for r in rows],
            ]),
        )
    ]
)
fig.update_layout(title="Kurtosis changes a lot across tail shapes (all standardized to mean=0, std=1)")
fig.show()
# Tail probability curves: P(|X| > t) highlights tail heaviness clearly
thresholds = np.linspace(0, 5, 101)

fig = go.Figure()
for name, x in samples.items():
    xs = standardize(x)
    probs = [(np.abs(xs) > t).mean() for t in thresholds]
    fig.add_trace(go.Scatter(x=thresholds, y=probs, mode="lines", name=name))

fig.update_layout(
    title="Tail heaviness via exceedance probability (log scale)",
    xaxis_title="threshold t (in standard deviations)",
    yaxis_title="P(|X| > t)",
    yaxis_type="log",
)
fig.show()

2) What does the kurtosis test do?#

The kurtosis test is a hypothesis test of whether the population kurtosis equals the Normal distribution’s kurtosis.

Let \(b_2\) be the sample Pearson kurtosis.

  • Null hypothesis: the data come from a population with Normal kurtosis.

    \[H_0: \; b_2 = 3\]
  • Alternative hypotheses:

    • two-sided: \(b_2 \neq 3\)

    • greater: \(b_2 > 3\) (heavier tails than Normal)

    • less: \(b_2 < 3\) (lighter tails than Normal)

The result is a z-score and a p-value:

  • The z-score sign indicates direction (heavy vs light tails).

  • The p-value is computed under \(H_0\); it is the probability of seeing a z-score at least as extreme as the observed one if the underlying distribution had Normal kurtosis.

⚠️ This is not a full Normality test: many non-Normal distributions can still have \(b_2 \approx 3\).

3) The Anscombe–Glynn test statistic (what we implement)#

A naive large-sample approximation is that excess kurtosis \(g_2\) is approximately Normal with variance \(24/n\), but the sampling distribution of kurtosis is skewed for realistic sample sizes.

The Anscombe–Glynn method transforms \(b_2\) into a quantity \(Z\) that is approximately \(\mathcal{N}(0,1)\) under \(H_0\).

For sample size \(n\) and sample Pearson kurtosis \(b_2\):

  1. Under Normality, compute the expected value and variance of \(b_2\):

\[E[b_2] = 3 \frac{n-1}{n+1}\]
\[\mathrm{Var}(b_2) = \frac{24n(n-2)(n-3)}{(n+1)^2(n+3)(n+5)}\]
  1. Standardize:

\[x = \frac{b_2 - E[b_2]}{\sqrt{\mathrm{Var}(b_2)}}\]
  1. Apply the Anscombe–Glynn transform to get \(Z\).

Then the p-value is computed from the standard Normal distribution.

Rule of thumb: p-values are most reliable for n > 20.

Alternative = Literal["two-sided", "less", "greater"]


def erf_approx(x: np.ndarray) -> np.ndarray:
    """Fast erf approximation (Abramowitz & Stegun 7.1.26).

    Vectorized NumPy implementation; max error ~1.5e-7.
    """
    x = np.asarray(x, dtype=float)
    sign = np.sign(x)
    ax = np.abs(x)

    p = 0.3275911
    a1 = 0.254829592
    a2 = -0.284496736
    a3 = 1.421413741
    a4 = -1.453152027
    a5 = 1.061405429

    t = 1.0 / (1.0 + p * ax)
    poly = (((((a5 * t + a4) * t) + a3) * t + a2) * t + a1) * t
    y = 1.0 - poly * np.exp(-(ax * ax))
    return sign * y


def normal_cdf(z: np.ndarray) -> np.ndarray:
    z = np.asarray(z, dtype=float)
    return 0.5 * (1.0 + erf_approx(z / np.sqrt(2.0)))


def normal_sf(z: np.ndarray) -> np.ndarray:
    return 1.0 - normal_cdf(z)


def pvalue_from_z(z: np.ndarray, alternative: Alternative) -> np.ndarray:
    z = np.asarray(z, dtype=float)
    if alternative == "two-sided":
        return 2.0 * normal_sf(np.abs(z))
    if alternative == "greater":
        return normal_sf(z)
    if alternative == "less":
        return normal_cdf(z)
    raise ValueError(f"Unknown alternative: {alternative!r}")


@dataclass(frozen=True)
class KurtosisTestResult:
    statistic: float
    pvalue: float
    n: int
    b2: float
    expected_b2: float
    var_b2: float
    alternative: str


def kurtosis_test_anscombe_glynn(
    x: np.ndarray,
    alternative: Alternative = "two-sided",
) -> KurtosisTestResult:
    """Anscombe–Glynn test of Normal kurtosis (NumPy-only).

    H0: population kurtosis equals Normal kurtosis (b2 = 3).

    Notes:
    - Requires at least 5 observations.
    - p-values may be inaccurate for n < 20 (rule of thumb).
    """
    x = np.asarray(x, dtype=float)
    x = x[np.isfinite(x)]
    n = int(x.size)
    if n < 5:
        raise ValueError(f"Need >=5 finite observations for the kurtosis test, got {n}.")

    b2 = sample_kurtosis_pearson(x)

    E = 3.0 * (n - 1.0) / (n + 1.0)
    varb2 = 24.0 * n * (n - 2.0) * (n - 3.0) / ((n + 1.0) ** 2 * (n + 3.0) * (n + 5.0))
    x_std = (b2 - E) / np.sqrt(varb2)

    sqrtbeta1 = (
        6.0 * (n * n - 5.0 * n + 2.0) / ((n + 7.0) * (n + 9.0))
        * np.sqrt(6.0 * (n + 3.0) * (n + 5.0) / (n * (n - 2.0) * (n - 3.0)))
    )
    A = 6.0 + 8.0 / sqrtbeta1 * (2.0 / sqrtbeta1 + np.sqrt(1.0 + 4.0 / (sqrtbeta1**2)))

    term1 = 1.0 - 2.0 / (9.0 * A)
    denom = 1.0 + x_std * np.sqrt(2.0 / (A - 4.0))

    if denom == 0.0:
        z = np.nan
    else:
        term2 = np.sign(denom) * ((1.0 - 2.0 / A) / np.abs(denom)) ** (1.0 / 3.0)
        z = (term1 - term2) / np.sqrt(2.0 / (9.0 * A))

    p = float(pvalue_from_z(z, alternative=alternative))

    return KurtosisTestResult(
        statistic=float(z),
        pvalue=p,
        n=n,
        b2=float(b2),
        expected_b2=float(E),
        var_b2=float(varb2),
        alternative=alternative,
    )

What the z-score means#

  • Z > 0: sample kurtosis is larger than expected under Normality → heavier tails / more extremes

  • Z < 0: sample kurtosis is smaller than expected under Normality → lighter tails / bounded data

The p-value turns this into a decision rule via your chosen \(\alpha\) (e.g. 0.05).

# Optional: compare our implementation to SciPy (should match closely)
try:
    from scipy.stats import kurtosistest as scipy_kurtosistest

    x_demo = rng.normal(size=200)
    ours = kurtosis_test_anscombe_glynn(x_demo)
    theirs = scipy_kurtosistest(x_demo)

    print('ours :', ours.statistic, ours.pvalue)
    print('scipy:', float(theirs.statistic), float(theirs.pvalue))
except Exception as e:
    print('SciPy comparison skipped:', e)
ours : 0.38188910789173036 0.7025437321638381
scipy: 0.3818891078917334 0.7025436197478272

4) Worked examples + interpretation#

We’ll test two samples:

  1. A truly Normal sample (we expect no strong evidence against \(H_0\) most of the time).

  2. A heavy-tailed sample (we expect evidence for \(b_2 > 3\)).

def standard_normal_pdf(z: np.ndarray) -> np.ndarray:
    z = np.asarray(z, dtype=float)
    return np.exp(-0.5 * z * z) / np.sqrt(2.0 * np.pi)


def plot_pvalue_shading(
    z: float,
    alternative: Alternative = "two-sided",
    title: str | None = None,
) -> go.Figure:
    xs = np.linspace(-4.5, 4.5, 800)
    ys = standard_normal_pdf(xs)

    fig = go.Figure()
    fig.add_trace(go.Scatter(x=xs, y=ys, mode="lines", name="N(0,1) pdf"))

    if alternative == "two-sided":
        z0 = float(abs(z))
        mask_left = xs <= -z0
        mask_right = xs >= z0
        fig.add_trace(
            go.Scatter(
                x=xs[mask_left],
                y=ys[mask_left],
                fill="tozeroy",
                mode="lines",
                name="left tail",
                line=dict(width=0),
            )
        )
        fig.add_trace(
            go.Scatter(
                x=xs[mask_right],
                y=ys[mask_right],
                fill="tozeroy",
                mode="lines",
                name="right tail",
                line=dict(width=0),
            )
        )
    elif alternative == "greater":
        mask = xs >= z
        fig.add_trace(
            go.Scatter(x=xs[mask], y=ys[mask], fill="tozeroy", mode="lines", name="upper tail", line=dict(width=0))
        )
    elif alternative == "less":
        mask = xs <= z
        fig.add_trace(
            go.Scatter(x=xs[mask], y=ys[mask], fill="tozeroy", mode="lines", name="lower tail", line=dict(width=0))
        )
    else:
        raise ValueError

    fig.add_vline(x=z, line_width=2, line_dash="dash", line_color="black")

    p = float(pvalue_from_z(z, alternative=alternative))
    fig.update_layout(
        title=title or f"p-value shading for z={z:.3f} ({alternative}), p={p:.4g}",
        xaxis_title="z",
        yaxis_title="density",
        showlegend=True,
    )
    return fig
# Example A: Normal sample
x_norm = rng.normal(size=200)
res_norm = kurtosis_test_anscombe_glynn(x_norm, alternative="two-sided")
res_norm
KurtosisTestResult(statistic=-0.08531513314186485, pvalue=0.9320107311880461, n=200, b2=2.897119191236123, expected_b2=2.970149253731343, var_b2=0.11136036352709348, alternative='two-sided')
fig = px.histogram(
    x_norm,
    nbins=40,
    histnorm="probability density",
    title="Example A: sample histogram (Normal)",
)
fig.update_layout(xaxis_title="x", yaxis_title="density")
fig.show()

plot_pvalue_shading(res_norm.statistic, alternative=res_norm.alternative).show()
# Example B: heavy tails
x_heavy = rng.standard_t(df=3, size=200)
x_heavy = standardize(x_heavy)  # compare tail shape, not scale
res_heavy = kurtosis_test_anscombe_glynn(x_heavy, alternative="greater")
res_heavy
KurtosisTestResult(statistic=8.138267812060432, pvalue=2.220446049250313e-16, n=200, b2=22.666621285524464, expected_b2=2.970149253731343, var_b2=0.11136036352709348, alternative='greater')
fig = px.histogram(
    x_heavy,
    nbins=40,
    histnorm="probability density",
    title="Example B: sample histogram (heavy tails, standardized)",
)
fig.update_layout(xaxis_title="x", yaxis_title="density")
fig.show()

plot_pvalue_shading(res_heavy.statistic, alternative=res_heavy.alternative).show()

Interpreting the p-value (and what it does not mean)#

  • If p ≤ α (e.g. 0.05): reject \(H_0\) → the kurtosis is inconsistent with Normal kurtosis.

  • If p > α: fail to reject \(H_0\) → you do not have evidence that kurtosis differs from Normal at this sample size.

The p-value is not the probability that \(H_0\) is true. It is a statement about what would happen if \(H_0\) were true.

# How sensitive is kurtosis to a few outliers?
base = rng.normal(size=200)
ks = list(range(0, 11))
b2s = []
for k in ks:
    x = base.copy()
    if k > 0:
        x[:k] = 8.0  # inject k identical extreme outliers
    b2s.append(sample_kurtosis_pearson(x))

fig = px.line(x=ks, y=b2s, markers=True, title="A few outliers can dominate kurtosis")
fig.update_layout(xaxis_title="number of injected outliers", yaxis_title="sample Pearson kurtosis b2")
fig.add_hline(y=3.0, line_dash="dash", line_color="black")
fig.show()

5) Diagnostics: does Z look Normal under H0? (simulation)#

Under \(H_0\) and with decent sample sizes, the transformed statistic \(Z\) should be roughly \(\mathcal{N}(0,1)\).

We’ll simulate many Normal samples, compute \(Z\), and compare histograms to the standard Normal pdf.

from plotly.subplots import make_subplots


def kurtosis_test_statistic_vectorized(x: np.ndarray) -> np.ndarray:
    """Vectorized Z computation for many samples at once.

    x: shape (reps, n)
    returns: shape (reps,)
    """
    x = np.asarray(x, dtype=float)
    reps, n = x.shape

    mean = x.mean(axis=1, keepdims=True)
    dev = x - mean
    m2 = np.mean(dev**2, axis=1)
    m4 = np.mean(dev**4, axis=1)
    b2 = m4 / (m2**2)

    E = 3.0 * (n - 1.0) / (n + 1.0)
    varb2 = 24.0 * n * (n - 2.0) * (n - 3.0) / ((n + 1.0) ** 2 * (n + 3.0) * (n + 5.0))
    x_std = (b2 - E) / np.sqrt(varb2)

    sqrtbeta1 = (
        6.0 * (n * n - 5.0 * n + 2.0) / ((n + 7.0) * (n + 9.0))
        * np.sqrt(6.0 * (n + 3.0) * (n + 5.0) / (n * (n - 2.0) * (n - 3.0)))
    )
    A = 6.0 + 8.0 / sqrtbeta1 * (2.0 / sqrtbeta1 + np.sqrt(1.0 + 4.0 / (sqrtbeta1**2)))

    term1 = 1.0 - 2.0 / (9.0 * A)
    denom = 1.0 + x_std * np.sqrt(2.0 / (A - 4.0))
    term2 = np.sign(denom) * ((1.0 - 2.0 / A) / np.abs(denom)) ** (1.0 / 3.0)

    z = (term1 - term2) / np.sqrt(2.0 / (9.0 * A))
    return z


reps = 6000
ns = [10, 20, 50, 200]

fig = make_subplots(rows=2, cols=2, subplot_titles=[f"n={n}" for n in ns])
xs = np.linspace(-4, 4, 400)
pdf = standard_normal_pdf(xs)

for i, n in enumerate(ns):
    x = rng.normal(size=(reps, n))
    z = kurtosis_test_statistic_vectorized(x)

    r = i // 2 + 1
    c = i % 2 + 1
    fig.add_trace(
        go.Histogram(
            x=z,
            nbinsx=60,
            histnorm="probability density",
            opacity=0.7,
            showlegend=False,
        ),
        row=r,
        col=c,
    )
    fig.add_trace(
        go.Scatter(
            x=xs,
            y=pdf,
            mode="lines",
            line=dict(color="black"),
            showlegend=False,
        ),
        row=r,
        col=c,
    )

fig.update_layout(title="Sampling distribution of Z under H0 (Normal data)")
fig.update_xaxes(title_text="Z")
fig.update_yaxes(title_text="density")
fig.show()

6) Practical guidance + pitfalls#

  • The test is most trustworthy for n > 20. With small samples, kurtosis is noisy and the approximation can be off.

  • The test is very sensitive to outliers. A single bad data point can flip the conclusion — sometimes that’s exactly what you want, sometimes it’s a data quality issue.

  • The test only targets kurtosis, not skewness. Combine with a skewness test or a broader normality test if you need overall Normality.

  • With very large \(n\), tiny deviations in kurtosis can be statistically significant; always check the estimated \(b_2\) / \(g_2\) and plots.

Exercises#

  1. Modify kurtosis_test_anscombe_glynn to support axis= for 2D arrays.

  2. Create a mixture distribution with \(b_2 \approx 3\) but visibly non-Normal; verify that the kurtosis test often fails to reject.

  3. Compare the kurtosis test to D’Agostino’s \(K^2\) test (which combines skewness and kurtosis).

References#

  • F. J. Anscombe, W. J. Glynn (1983). Distribution of the kurtosis statistic b2 for normal samples. Biometrika 70(1): 227–234.

  • SciPy: scipy.stats.kurtosistest (implementation follows Anscombe–Glynn).